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Abstract 

The  Bunch-Kaufman  algorithm  is  the  method  of  choice  for  factor¬ 
ing  symmetric  indefinite  matrices  in  many  applications.  However,  the 
Bunch-Kaufman  algorithm  does  not  take  advantage  of  high-performance 
architectures  such  as  the  Cray  Y-MP.  Three  new  algorithms,  based 
on  Bunch-Kaufman  factorization,  that  take  advantage  of  such  archi¬ 
tectures  are  described.  Results  from  an  implementation  of  the  third 
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1  Introduction 

The  Bunch-Kaufman  algorithm  is  considered  one  of  the  best  methods  for 
factoring  full,  symmetric,  indefinite  matrices  [1],  [2].  A  modified  version  has 
been  successfully  used  to  factor  sparse,  indefinite  matrices  [3].  Recently, 
Runch-Keufman  factorization  has  been  shown  to  be  the  method  of  choice  for 
a  subset  of  banded,  symmetric  indefinite  matrices  [4], 

The  Bunch-Kaufman  algorithm  maintains  the  symmetry  of  the  matrix 
during  factorization  and  yields  the  inertia  of  the  matrix  essentially  for  free, 
an  important  consideration  for  eigenvr  lue  calculations  [1].  A  drawback  to  the 
Bunch-Kaufman  algorithm  is  its  unsuitability  for  high-performance  architec¬ 
tures.  Herein,  three  new  algorithms,  based  on  Bunch-Kaufman  factorization, 
are  given  for  architectures  such  as  the  Cray  Y-MP. 

The  technique  of  loop  unrolling  for  vector  architectures  is  discussed  in 
section  2.  In  reef  ion  3,  one  of  several  variations  of  the  Bunch-Kaufman  al¬ 
gorithm  is  reviewed  and  the  reason  for  its  unsuitability  for  high-performance 
architectures  is  given.  Three  new  algorithms  for  high-performance  architec¬ 
tures  are  developed  in  section  4.  Results  showing  the  benefits  of  the  third 
algorithm  are  given  in  section  5.  Finally,  a  summary  and  description  of  future 
wrork  is  given  in  section  6. 

2  Loop-Unrolling 

Loop  unrolling  is  a  well  known  technique  for  improving  performance  on  vector 
architectures.  A  loop  is  unrolled  by  restructuring  it  to  allow  more  compu¬ 
tation  to  take  place  at  each  step.  A  simple  example  of  loop  unrolling  from 
[5]  is  given  in  Figure  1.  The  outer  DO-loop  has  been  unrolled  to  a  depth  of 
four.  In  the  original  program  segment,  three  vector  memory  references  were 
required  for  every  two  vector  floating  point  operations.  The  ratio  for  the  un¬ 
rolled  program  segment  is  six  vector  memory  references  for  every  eight  vector 
floating  point  operations.  A  significant  decrease  in  the  number  of  memory 
references  has  been  achieved. 

The  reduction  in  the  number  of  vector  memory  operations  reduces  the 
probability  of  delays  due  to  memory  latency  times  as  well  as  the  possibility 
of  memory  contention  in  a  parallel  computer  [6]. 

other  benefits  of  loop  unrolling  are  described  in  [71-  The  first  is  a 
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C  Original  program  segment 
DO  20  J  =  1,  N2 
DO  10  I  =  1,  N1 
Y(I)  =  Y(I)  +  X(J)  *  M(I,J) 

10  CONTINUE 

on  p/^XTrnTMTT-r 

0^*1  i.A»«  U  J-/ 

C  In  this  example,  the  end  condition  if  N2  isn’t  a  multiple  of  four  is  ignored 
DO  20  J  =  4,  N2,  4 
DO  10  I  =  1,  N1 

Y(I)  =  Y(I)  +  X(J-3)  *  M(I,J-3)  +  X(J-2)  *  M(I,J-2)  + 
c  X(J-l)  *  M(I,J-1)  +  X(J)  *  M(I,J) 

10  CONTINUE 

20  CONTINUE 

Figure  1:  Simple  loop  unrolling  example 

reduction  in  loop  overhead  because  fewer  incrementing  and  testing  operations 
are  required.  This  benefit  can  be  reaped  by  any  computer  architecture. 

For  computers  with  segmented  functional  units,  such  as  the  CDC  7600, 
the  higher  ratio  of  floating  point  operations  to  overhead  operations  will  allow 
a  higher  level  of  concurrency  within  a  functional  unit. 

Computers  with  independent  functional  units,  such  as  the  Cray-1,  benefit 
from  greater  concurrency  between  the  functional  units. 

The  optimal  depth  of  loop  unrolling  is  largely  dependent  on  the  target 
architecture.  For  example,  if  the  independ'  .  jnctional  units  of  a  cnmp"lpT 
are  kept  busy  with  loop  unrolling  of  depth  en  increasing  the  depth  to 
p  +  1  will  not  result  in  increased  concurrency  among  functional  units. 

In  the  simple  example  in  Figure  1,  the  results  of  iteration  j  of  the  outer 
loop  did  not  depend  on  results  of  previous  iterations.  Therefore,  the  outer 
loop  was  easily  unrolled.  If  LDLT  decomposition  is  considered,  however,  each 
iteration  of  the  outer  loop  depends  on  the  previous  iterations  (see  Figure  2). 
Unrolling  the  outer  locp  causes  several  pivot  columns  to  be  used  simultane¬ 
ously  to  update  the  remaining  non-zeroes.  For  the  algorithm  to  be  correct, 
the  first  pivot  column  must  be  used  to  update  the  other  pivot  columns,  then 
the  second  pivot  column  used  to  update  the  remaining  pivot  columns,  and  so 
forth.  After  all  the  pivot  columns  are  updated,  they  are  used  to  update  the 
remaining  non-zeroes.  Conceptually,  loop  unrolling  in  LDLT  allows  the  use 
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1)  DO  10  I  =  1,  N 

C  Compute  the  pivot  column 

2)  DO  20  J  =  1+1,  N 

3)  V(J)  =  A(J,I) 

4)  A(J  ,1)  =  A(J,I)/A(I,I) 

5)  20  CONTINUE 

C  Update  the  remaining  non-zeroes 

6)  DO  30  J  =  1+1,  N 

7)  DO  40  K  =  J,  N 

8)  A(J,K)  =  A(J,K)  -  V(K)*A(J,I) 

9)  40  CONTINUE 

10)  30  CONTINUE 

11)  10  CONTINUE 


Figure  2:  The  LDLT  algorithm 


of  matrix-matrix  operations  rather  than  matrix-vector  operations.  A  version 
of  LDLt  unrolled  to  a  depth  of  three  is  given  in  Figure  3. 

Because  three  pivot  columns  are  used  to  update  the  remaining  non-zeroes 
in  step  20,  each  time  an  element,  is  fetched  six  floating  point  computa¬ 
tions  are  done,  rather  than  just  two  as  in  step  8  of  the  original  algorithm. 


3  The  Bunch-Kaufman  Algorithm 

The  Bunch-Kaufman  algorithm  factors  A,  an  n  x  n  real  symmetric  indefinite 
matrix,  into  LDLT  while  doing  symmetric  permutations  on  A  to  maintain 
stability,  resulting  in  the  following  equation: 

PAPT  =  LDLt.  (1) 

Although  several  variations  of  the  algorithm  exist,  the  focus  here  is  on  algo¬ 
rithm  D  from  [1]  because  it  is  the  simplest  to  discuss.  The  methods  described 
in  section  4  are  also  applicable  to  Algorithm  A  described  in  [1],  but  not  to 
Algorithm  C  (Algorithm  B  is  mentioned  in  [1],  but  it  is  not  described  in 
detail). 

The  Bunch-Kaufman  algorithm  maintains  stability  by  using  2x2  pivots 
combined  with  symmetric  permutations  to  A  when  a  lxl  pivot  is  not  stable. 


3 


C  Loop  unrolled  version 

C  The  end  condition  for  handling  N  not  divisible  by  3  has  been  ignored 

1)  DO  10  I  =  1,  N,  3 

C  Compute  the  upper  3x3  triangle  of  the  pivot  columns 

2)  V1(I+1)  =  A(I+1,I) 

3)  A(!4  1,1)  =  A(I+1,I)/A(I,I) 

4)  Vl(I+2)  =  A(I+2.I) 

')  A(I-f  2,1)  =  A(I+2,I)/A(I,I) 

6)  A(I+l,I+l)  =  A(I+1,I+1)  -  V1(I+1)*A(I+1,I) 

7)  V2(l4  2)  =  A(I+2,I+1)  -  V1(I+1)*A(I+2,I) 

8)  A(I+2,T+1)  -  V2(I+2)/A(I+l,I'  1) 

9)  A(I+2,l4-2)  =  A(I+2,I+2)  -  V1(I  f  2)*A(I+2,I)  -  V2(I+2)*A(I+2,I+1) 
C  Update  and  compute  all  three  pivot  columns 

10)  DO  20  J  =  1+3,  N 

11)  V1(J)  =  A(J,I) 

12)  A(J,I)  =  A(J,I)/A(I,I) 

13)  V2(J)  =  A(J,I+1)  -  V1(I+1)*A(J,I) 

14)  A(J,I+1)  =  V2(J)/A(I+1,I+1) 

13)  V3(J)  =  A(J,I+2)  -  V1(I+2)*A(J,I)  -  V2(I+2)*A(J,I+1) 

16)  A(J,I+2)  =  A:j,I+2)/A(I+2,I+2) 

17)  20  CONTINUE 

C  Use  the  3  pivots  columns  to  update  the  remaining  non-zeroes 

18)  DO  30  J  =  1+3,  N 

9)  DO  40  K  =  J,  N 

20)  A(J,K)  =  A(J,K)  -  V1(K)*A(J,I)  -  V2(K)*A(J,I+1)  - 

C  V3(K)*  A(J,I+2) 

21)  40  CONTINUE 

22)  30  CONTINUE 

23)  10  CONTINUE 

Figure  3:  LDLT  unrolled  to  a  depth  of  three 


Because  this  paper  will  concentrate  on  ]  xl  pivots,  only  stability  for  these 
pivots  will  be  discussed  in  detail.  A  lxl  pivot  for  element  ahJ  at  step  k  takes 

the  form 

ai,j  =  ax,J  —  ai,kaj,k/ Q-k,k-  (2) 

Let  Hk  be  the  maximum  of  the  absolute  values  of  the  uneliminated  elements 
at  step  k.  Step  2  of  the  algorithm  (shown  in  Figure  4)  finds  the  maximum 
element,  A,  in  the  pivot  column.  By  substituting  p  and  A  into  equation  2, 
the  bound  on  pt+1  becomes 

Afc+i  <  Mfc  +  I  ak,k  |<  Pt(  1  +  A/  |  akk  |).  (3) 

Step  4  ensures  that  a  lxl  pivot  occurs  if  a  <j  ati,  |  /A,  where  the  parameter 
a  has  been  chosen  to  be  0.525  to  maximize  stability  for  Algorithm  D  [1],  By 
substituting  a  into  equation  3, 

Mk+i  <  Afc(l  +  !/<*)•  (4) 

Therefore,  the  bound  on  the  growth  of  an  element  due  to  a  lxl  pivot  is  2.905. 

If  the  test  in  step  4  is  failed,  a  subsequent  row  search  and  another  stability 
test  determines  if  a  2x2  pivot  and  a  permutation  are  necessary. 

The  stability  checks  and  possible  permutations  at  each  step  of  the  Bunch- 
Kaufman  algorithm  prevent  the  use  of  the  same  type  of  loop  unrolling  that 
is  used  for  LDLT  decomposition.  Because  the  stability  checks  and  permuta¬ 
tions  must  be  completed  before  a  pivot  column  is  computed,  pivot  columns 
cannot  be  grouped  as  they  were  in  Figure  2  without  invalidating  the  bounds 
on  element  growth. 


4  New  Algorithm 

This  section  will  develop  three  ways  in  which  the  Bunch-Kaufman  algorithm 
can  be  modified  to  allow  pivot  columns  to  be  grouped  together  in  one  step. 
Because  each  2x2  pivot  involves  a  permutation  of  A,  it  is  not  possible  to  group 
2x2  pivots  together.  However,  lxl  pivots  can  be  grouped  with  a  2x2  pivot  if 
they  follow  the  2x2  pivot,  allowing  the  permutation  to  precede  the  updating 
and  computing  of  pivot  columns.  Each  2x2  pivot  can  be  implemented  using 
loop  unrolling  of  depth  2.  The  general  strategy  in  this  section  will  be  to 
try  to  group  several  lxl  pivots  into  a  single  step  in  a  stable  fashion.  The 


5 


1)  for  i  —  l,n 

begin 

2)  X  ~  m— Xj— n  |  a,j{  | 

3)  set  r  to  the  row  number  of  A 

4)  if  A  a  <  j  a,it.  |  then 
begin 

5)  perform  a  lxl  pivot 
end 

else 

begin 

6)  o  =  maxJ=i+iin  |  <ir,j  | 

7)  if  a  A2  <  a  |  a,i:i  |  then 
begin 

8)  perform  a  lxl  pivot 

end 

else 

begin 

9)  exchange  rows  and  columns  r  and  i+  1 

10)  perform  a  2x2  pivot 

11)  i  =  i  +  1 
end 

end 

12)  end 

13)  if  inertia  is  desired,  then  scan  the  D  matrix 

Figure  4:  The  Bunch-Kaufman  Factorization  Algorithm 


6 


strategies  described  in  this  section  are  applicable  to  Algorithms  A  and  D, 
but  not  C  from  [1  ] .  Algorithm  C  cannot  be  unrolled  using  these  strategies 
because  a  permutation  occurs  at  every  step. 

The  first  approach  uses  pxp  pivots  in  much  the  same  way  as  the  2x2  pivots 
in  the  Bunch-Kaufman  algorithm.  The  update  of  a  single  element  of  A  using 
a  2x2  pivot  at  step  k  is 

{ax,kak+l,k+l  ~  ax.k+lak+l,k)dj,k  +  ( ax,k+lak,k  ~  0.x,kak+l,k)aj,k+\ 

—  ax,j  2  • 

(5) 

To  obtain  a  bound  for  element  growth,  first  define  at  step  k 

Ai  =  77icxi=fc+1,n  |  altk  |  (6) 


A2  —  maxi=lc4-2.n  \  Gl,k  + 1  I  • 

If  pk,  Ai,  and  A2  are  substituted  into  equation  5,  then 

^  +  ^2  +  2AxA2 


llk+ 2  5;  Hk{  1  + 


l,i+l  ai+l .» 


The  bound  on  growth  of  pk+2  f°r  a  2x2  pivot  in  Algorithm  D  is  8.52b  [1], 
Therefore,  a  2x2  pivot  can  be  performed  if 

1  +  t - (Al  ■— 2J_.? - .  <  8.526.  (9) 

I  ai,iai+l,i+l  ai+l,l  I 

This  derivation  is  similar  to  the  2x2  pivot  stability  analysis  in  [1],  This  test 
differs  from  the  Bunch-Kaufman  2x2  test  because  it  groups  two  potential 
lxl  pivots  into  one  step.  The  Bunch-Kaufman  2x2  test  is  used  when  a  lxl 
pivot  is  not  stable.  The  new  test  precedes  step  4  of  the  Bunch-Kaufman 
algorithm  in  Figure  4.  This  approach  has  two  drawbacks:  1)  the  formulae 
for  bounding  the  growth  due  to  a  pxp  pivot  become  increasingly  complicated 
as  p  increases,  and  2)  the  inertia  calculation  is  no  longer  just  a  search  down 
the  diagonal,  it  requires  the  solution  of  many  pxp  eigenproblems. 

The  second  approach  updates  the  potential  pivot  columns  one  at  a  time 
and,  after  each  update,  applies  the  lxl  pivot  stability  check  to  determine  if 


4a)  if  A  a  <  !  axx  j  then 
begin 

4b)  compute  the  zth  pivot  column  and  use  it  to  update  column  i  +  1 

4c)  A2  =  maxJ=t+2:TX  |  aJit+1  j 

4d)  if  A2  a  >|  a,4-ilt+ 1  j  then 
begin 

4e)  compute  the  (z  4-  l)th  pivot  column  and  use  it  and  column  i  to 
update  column  i  f  2 

4f)  A3  =  max;=1+3,n  |  cJiX+2  | 

4g)  if  A3  a  >|  a,  +  2>,+2  |  then 
begin 

4h)  use  columns  i,  i  +  1,  and  i  +  2  to  update  the  remaining 

non-zeroes 
end 
else 
begin 

4i)  use  columns  i  and  z  +  1  to  update  the  remaining  non-zcucs 

end 
end 
else 
begin 

4j)  perform  a  lxl  pivot 
end 
end 


Figure  5:  Approach  Two  for  Loop  Unrolling 
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further  unrolling  is  possible.  This  test  replaces  steps  4  and  5  of  the  Bunch- 
Kaufman  algorithm.  An  example  of  this  test  for  up  to  three  columns  is  given 
in  Figure  5. 

The  third  strategy  uses  an  a  priori  approach  to  predict  stability.  The 
strategy  predicts  the  stability  of  grouping  p  lxl  pivots  without  updating 
each  potential  pivot  column.  Only  the  upper  pxp  triangle  must  be  updated 
to  bound  element  growth.  From  equation  4,  for  p  successive  lxl  pivots  to 
maintain  stability,  the  maximum  element  growth  must  be  bounded  by 

( 1  4-  1  /  q)p.  (10) 


A  t 


:tep  k 


•'2  mQ-Xj-lc  f2,n  :  1 


00 


The  bound  on  element  growth  for  a  lxl  pivot  is  (1  +  X/nkk),  therefore  the 
bound  on  ,\2  after  a  lxl  pivot  is 


A  2  <  A2(  1  -T-  Xj ak  k). 


(12) 


Because  the  upper  pxp  triangle  has  been  updated,  a  bound  on  fikt-2  for  a 
second  lxl  pivot  using  column  hr  1  is 

Pk  +  2  T  /i*:  m  ( 1  +  X2I (Ik+I.ki  1  )•  (13) 

By  substituting  the  bound  for  //*. r  1  into  equation  13 

Ffc f 2  0  Pk{  1  1  A/a*.,fc)(l  +  A2 /afc  +  i,jt+ 1  )■  (14) 


In  general,  the  bound  on  pk  +  P  1  for  p  lxl  pivots  is 

P  A:  •-  p  2  X  k 4- p—  1 
Qfcf  p-1, to  p-1 


pk  f  p-  1  T  P k  *  p  -  2  (  1 


(15) 


where 


Afcup-i  ~  Tnox j  k |  p n  ^  (ij'k-rp-i  !  ■  (If!) 

Given  the  bound  for  p-1  lxl  pivots,  the  only  new  information  necessary 
is  the  updated  value  of  a/tfp-i.fct  v ->  ancl  ^t+p-i-  If  the  bound  for  /iq(p_i) 
is  small  enough  then  the  p  pivot  columns  arc  computed  at  the  same  time 
and  all  used  at  the  same  time  to  update  the  remaining  non-zeroes.  The  a 
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■priori  strategy  has  two  advantages  over  the  second  strategy.  First,  it  allows 
all  the  pivot  columns  to  be  updated  and  computed  in  one  loop.  The  second 
strategy  requires  the  pivot  columns  to  be  updated  and  computed  one  after 
the  other.  Therefore,  the  a  prion  method  reaps  the  benefits  of  loop  unrolling 
in  the  pivot  column  calculation.  Second,  the  a  priori  method  can  combine  a 
pivot  that  fails  the  Ixl  pivot  test  with  a  lxl  pivot  that  is  very  stable  if  the 
combination  of  the  two  meets  the  stability  criterion.  The  second  method  is 
unable  to  combine  ihe  two  pivots  if  one  of  the  pivots  fails  the  lxl  stability 
test.  Another  benefit  the  a  prion  method  reaps  from  this  combination  is 
avoiding  the  search  for  er  in  step  5  of  the  Bunch-Kaufmaa  algorithm.  A 
disadvantage  of  the  a  pnon  method  w’hen  compared  wdth  the  second  method, 
is  the  use  of  estimated  bounds.  In  some  cases  these  bounds  are  too  pessimistic 
and  thereby  prevent  the  combining  of  lxl  pivots  when  the  combination  is 
stable.  Because  the  second  method  actually  computes  the  pivot  columns  it 
does  not  have  to  use  estimated  bounds. 

5  Results 

A  version  of  the  a  priori  algorithm  described  in  section  4  suitable  for  variable 
banded  systems  was  implemented  on  a  Cray  Y-MP.  The  uniprocessor  imple¬ 
mentation  allows  loon  unrolling  up  to  depth  six  to  take  place.  When  the 
maximum  depth  of  loop  unrolling  is  fixed  at  one,  this  algorithm  is  identical 
to  the  Bunch-Kaufman  algorithm.  The  Cray  Y-MP  is  a  register-to-register 
parallel/ vector  computer  with  up  to  eight  processors.  Each  processor  has 
independent,  segmented  functional  units.  An  indefinite  matrix  that  arose 
from  an  application  in  structural  engineering  was  factored  and  the  resulting 
triangular  matrices  were  solved.  The  order  of  the  matrix  was  10,785  and  the 
average  bandwidth  was  416.  A  significant  reduction  in  factorization  time  and 
solution  time  for  the  resulting  triangular  systems,  due  to  increasing  depths 
of  loop  unrolling  is  shown  in  Figure  6.  For  this  matrix,  the  combination  of 
six  pivot  columns  never  met  the  stability  criterion. 

The  benefits  of  the  algorithm  will  vary  depending  on  the  architecture 
and  on  the  matrix  being  solved.  For  example,  the  same  implementation  was 
executed  on  the  Convex  C-220,  an  architecture  with  independent,  segmented 
functional  units.  An  indefinite  matrix  arising  from  the  same  application  was 
factored,  but  with  an  order  cf  6734  and  an  average  bandwidth  of  336.  The 


10 


Depth 

Factorization 
Time  (sec.) 

Speedup  over 
Bunch-Kaufman 

Triangular  Solution 
Time  (sec.) 

Speedup  over 
Bunch-Kaufman 

1 

16.3 

1.00 

0.32 

2 

13.4 

1.22 

0.23 

1.39 

3 

12.9 

1.26 

0.21 

1.52 

4 

12.6 

1.29 

0.21 

1.52 

5 

12.8 

1.27 

0.20 

1.60 

Figure  6:  Effects  of  different  depths  of  loop  unrolling  on  the  Cray  Y-MP 


Depth 

Factorization 
Time  (sec.) 

Speedup  over 
Bunch-Kaufman 

Triangular  Solution 
Time  (sec.) 

Speedup  over 
Bunch-Kaufman 

1 

71.77 

1.00 

1.30 

1.00 

2 

1.41 

1.05 

1.24 

3 

46.85 

1.53 

0.98 

1.33 

4 

45.37 

1.58 

0.96 

1.35 

5 

45.15 

1.59 

0.97 

1.34 

Figure  7:  Effects  of  different  depths  of  loop  unrolling  on  the  Convex  C-220 

maximum  speedup  due  to  the  a  priori  algorithm  is  shown  in  Figure  7  to  be 
significantly  better  for  this  combination  of  architecture  and  matrix  than  for 
the  combination  in  Figure  6. 

Observing  the  number  of  times  each  depth  of  loop  unrolling  is  utilized 
can  be  useful  in  determining  the  best  maximum  depth  of  loop  unrolling  for 
a  particular  application.  Shown  in  Figure  8  is  the  number  of  times  that 
each  depth  of  loop  unrolling  was  utilized  when  factoring  the  same  matrix 
used  in  the  Convex  C-220  test.  These  results  will,  of  course,  be  different  for 
every  matrix  factored,  but  may  be  similar  for  matrices  arising  from  the  same 
application. 


Maximum 
Possible  Depth 

No.  of 
Depth  1 

No.  of 
Depth  2 

No.  of 
Depth  3 

No.  of 
Depth  4 

No.  of 
Depth  5 

No.  of 
Depth  6 

1 

6732 

1 

- 

- 

- 

- 

2 

658 

3038 

- 

- 

- 

- 

3 

516 

1195 

1  1 

- 

- 

- 

4 

543 

992 

13 

730 

- 

- 

5 

561 

874 

439 

432 

276 

- 

6 

561 

874 

439 

432 

276 

0 

Figure  8:  Depths  of  loop  "nrolled  achieved 

6  Summary  and  Future  Work 

Three  algorithms,  each  based  on  the  Bunch-Kaufman  algorithm,  suitable  for 
factoring  symmetric  indefinite  matrices  on  high-performance  architectures 
were  given.  The  third  algorithm,  called  the  a  priori  strategy,  was  deter¬ 
mined  to  be  superior  to  the  other  two  and  was  implemented  on  two  high- 
performance  architectures,  the  Cray  Y-MP  and  the  Convex  C-220.  The  a  pri¬ 
on  algorithm  was  shown  to  be  significantly  faster  than  the  Bunch-Kaufman 
algorithm. 

The  a  pnon  algorithm  is  also  suitable  for  implementation  on  parallel 
architectures  because  it  allows  the  use  of  matrix-matrix  operations,  rather 
than  the  matrix-vector  operations  used  in  the  Bunch-Kaufman  algorithm. 
The  use  of  the  a  priori  algorithm  will  increase  the  ratio  of  computation  to 
synchronization.  The  authors  are  currently  implementing  this  algorithm  on 
a  parallel  architecture. 

Another  possible  application  for  the  a  priori  strategy  is  the  factorization 
of  sparse  matrices.  A  variant  of  the  Bunch-Kaufman  algorithm  that  includes 
pivoting  to  preserve  sparsity  is  given  in  [3].  A  modified  version  of  the  a  priori 
strategy  may  improve  the  performance  of  this  algorithm  on  high-performance 
architectures. 
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